clouds <- read.csv('data/clouds.csv')
str(clouds)'data.frame': 20 obs. of 2 variables:
$ moisture : num 301 302 299 316 307 ...
$ treatment: chr "seeded" "seeded" "seeded" "seeded" ...
First steps
Julien Martin
University of Ottawa
April 29, 2024
introduce you to some basic statistics in R ✔️
focus on linear models ✔️
fit simple linear models in R ✔️
check linear model assumptions in R ✔️
many, many statistical tests available in R
range from the simple to the highly complex
many are included in standard base installation of R
you can extend the range of statistics by installing additional packages
does seeding clouds with dimethylsulphate alter the moisture content of clouds (can we make it rain!)
10 random clouds were seeded and 10 random clouds unseeded
what’s the null hypothesis?
Two Sample t-test
data: clouds$moisture by clouds$treatment
t = 2.5404, df = 18, p-value = 0.02051
alternative hypothesis: true difference in means between group seeded and group unseeded is not equal to 0
95 percent confidence interval:
1.482679 15.657321
sample estimates:
mean in group seeded mean in group unseeded
303.63 295.06
an alternative, but equivalent approach is to use a linear model to compare the means in each group
general linear models are generally thought of as simple models, but can be used to model a wide variety of data and exp. designs
traditionally statistics is performed (and taught) like using a recipe book (ANOVA, t-test, ANCOVA etc)
general linear models provide a coherent and theoretically satisfying framework on which to conduct your analyses
t-test
ANOVA
factorial ANOVA
ANCOVA
linear regression
multiple regression
etc, etc
response variable ~ explanatory variable(s) + error
literally read as ‘variation in response variable modelled as a function of the explanatory variable(s) plus variation not explained by the explanatory variables’
it’s the attributes of the response and explanatory variables that determines the type of linear model fitted
lm()-the response variable comes first, then the tilde ~ then the name of the explanatory variable
anova() functionAnalysis of Variance Table
Response: moisture
Df Sum Sq Mean Sq F value Pr(>F)
treatment 1 367.22 367.22 6.4538 0.02051 *
Residuals 18 1024.20 56.90
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
do you notice anything familiar about the p value?
(hint: see the output from the t-test we did earlier)
we have sufficient evidence to reject the null hypothesis (as before)
therefore, there is a significant difference in the mean moisture content between clouds that were seeded and unseeded clouds
do we accept this inference?
what about assumptions?
we could use Shapiro-Wilks and F tests as before
much better to assess visually by plotting the residuals
clouds.lm is a linear model object we can do stuff with it
we can use the plot() function directly to display residual plots
normality assumption
equal variance assumption
unusual or influential observations
| traditional name | model formula | R code |
|---|---|---|
| simple linear regression | Y ~ X1 (continuous) | lm(Y ~ X) |
| one-way ANOVA | Y ~ X1 (categorical) | lm(Y ~ X) |
| two-way ANOVA | Y ~ X1 (cat) + X2 (cat) | lm(Y ~ X1 + X2) |
| ANCOVA | Y ~ X1 (cat) + X2 (cont) | lm(Y ~ X1 * X2) |
| multiple regression | Y ~ X1 (cont) + X2 (cont) | lm(Y ~ X1 + X2) |
| factorial ANOVA | Y ~ X1 (cat) * X2 (cat) | lm(Y ~ X1 * X2) |
Credit: I borrowed slides from Alex Douglas.